Multi - Armed Bandits in Discrete and Continuous Time

نویسندگان

  • Haya Kaspi
  • Avishai Mandelbaum
چکیده

We analyze Gittins' Markovian model, as generalized by Varaiya, Wal-rand and Buyukkoc, in discrete and continuous time. The approach resembles Weber's modification of Whittle's, within the framework of both multi-parameter processes and excursion theory. It is shown that index-priority strategies are optimal, in concert with all the special cases that have been treated previously. 1. Introduction. A multi-armed bandit is a control model that supports dynamic allocation of scarce resources in the face of uncertainty [2, 3, 8, 15]. Each arm of the bandit represents an ongoing project and pulling arms corresponds to allocating resources among the projects. In a discrete-time model, arms are pulled one at a time and each pull results in a reward. This is in contrast to continuous time, where a more appropriate view is that of a resource (time, effort) which is to be allocated simultaneously among the arms while accruing rewards continuously. The goal is to identify optimal allocation strategies, and in this paper it is achieved for bandits with independent arms and random rewards, discounted over an infinite horizon. Specifically, we analyze Gittins's Markovian model [9, 8] in discrete and continuous time, as generalized by [18] and [13] in the spirit of [17]. The approach resembles Weber's [20] modification of [21] (see also [5–7]), within the framework of both multiparameter processes [12, 13] and excursion theory [11]. It differs from [6, 7], which take a martingale-based approach. The outcome is a rigorous proof that is shorter and, in our opinion, conceptually clearer than its predecessors, both in discrete time [18, 5, 3, 12] but especially in continuous time [6, 7, 13, 14]. Of interest also is the connection with general excursion theory [1]; see, for example, the index representation (33), which generalizes (4.3) in [11] from a Markovian setting. The continuous-time model is formulated and its solution presented in Section 2. One could view discrete-time bandits as a special case of continuous time, where rewards and information change only on a discrete set of predictable epochs. Nevertheless, Section 3 constitutes a self-contained treatment in discrete time: being short and accessible, it provides an introduction to the solution in continuous time, by highlighting main ideas that are not obscured by (unavoidable) technicalities. Properties of the index process are developed in Section 4 and used, in Section 5, to solve the multi-armed bandit problem.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Unimodal Bandits: Regret Lower Bounds and Optimal Algorithms

We consider stochastic multi-armed bandits where the expected reward is a unimodal function over partially ordered arms. This important class of problems has been recently investigated in (Cope, 2009; Yu & Mannor, 2011). The set of arms is either discrete, in which case arms correspond to the vertices of a finite graph whose structure represents similarity in rewards, or continuous, in which ca...

متن کامل

Lipschitz Bandits: Regret Lower Bound and Optimal Algorithms

We consider stochastic multi-armed bandit problems where the expected reward is a Lipschitz function of the arm, and where the set of arms is either discrete or continuous. For discrete Lipschitz bandits, we derive asymptotic problem specific lower bounds for the regret satisfied by any algorithm, and propose OSLB and CKL-UCB, two algorithms that efficiently exploit the Lipschitz structure of t...

متن کامل

Budgeted Bandit Problems with Continuous Random Costs

We study the budgeted bandit problem, where each arm is associated with both a reward and a cost. In a budgeted bandit problem, the objective is to design an arm pulling algorithm in order to maximize the total reward before the budget runs out. In this work, we study both multi-armed bandits and linear bandits, and focus on the setting with continuous random costs. We propose an upper confiden...

متن کامل

A Note on Bandits with a Twist

A variant of the multi-armed bandit problem was recently introduced by Dimitriu, Tetali and Winkler. For this model (and a mild generalization) we propose faster algorithms to compute the Gittins index. The indexability of such models follows from earlier work of Nash on generalized bandits.

متن کامل

Unimodal Bandits: Regret Lower Bounds and Optimal Algorithms

We consider stochastic multi-armed bandits where the expected reward is a unimodal function over partially ordered arms. This important class of problems has been recently investigated in (Cope, 2009; Yu & Mannor, 2011). The set of arms is either discrete, in which case arms correspond to the vertices of a finite graph whose structure represents similarity in rewards, or continuous, in which ca...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998